Production

The first plot shows all the production indicators from both the current studies and the original framework in the y-axis. Orange indicates that the indicator is only being used in the current studies, purple that it is only included in the Wiltshire framework, and green that the indicator is used in both the framework and current studies.

The x-axis shows the number of secondary data metrics that have been collected to represent those indicators. You can see that there are some indicators for which there exist many data, but many indicators for which I have found little to represent them.

Value-added market indicators are pulled from various NASS, as are the total quantity of food and forest products and production inputs. There is plenty more that might be pulled from NASS here. Imports and exports are from the Economic Research Service. The exports data are far more detailed than the imports. The former are disaggregated by category at the state level (fresh fruit, processed fruit, dairy…) which is why there are a heap of metrics for it. The import data is weak - I could only find the value of the top five agricultural imports for each state, not a total. Recalls are from FDA records, but I have not any helpful information the impact of recalls in terms of food safety. Crop diversity is represented in the richness indicator by the Cropland CROS data set, which provides estimates of the area of farmland devoted to specific crops across the US. I have disaggregated these at the county and state levels here.

You can see there is plenty more in the frameworks that are not represented by secondary data here, particularly related to the consumer side - marketability, nutrition, food waste, and safety. I suspect some of these indicators will migrate toward other dimensions in the refinement process as well. But this does help identify some gaps in the data.

Code
pacman::p_load(
  dplyr,
  ggplot2,
  stringr,
  plotly,
  RColorBrewer
)

# Load production tree with use notes
prod_tree <- read.csv('data/trees/prod_tree_with_use.csv')

# Counts of secondary data metrics
counts <- meta %>% 
  group_by(Indicator) %>% 
  dplyr::summarize(count = n())

# Join to Wiltshire framework
colors <- RColorBrewer::brewer.pal(n = 3, name = 'Dark2')
dat <- full_join(prod_tree, counts, by = join_by(Indicator == Indicator)) %>% 
  arrange(Indicator) %>% 
  mutate(
    count = ifelse(is.na(count), 0, count),
    label_color = case_when(
      Use == 'both' ~ colors[1],
      Use == 'wiltshire' ~ colors[3],
      Use == 'current' ~ colors[2]
    )
  )
# [1] "#1B9E77" "#D95F02" "#7570B3"

# Plot
dat %>%
  ggplot(aes(x = Indicator, y = count)) +
  geom_col(
    color = 'black',
    fill = 'grey'
  ) +
  geom_point(
    data = dat,
    aes(x = 1, y = 1, color = Use),
    inherit.aes = FALSE,
    alpha = 0,
    size = -1
  ) +
  scale_color_manual(
    name = "Indicator Use:",
    values = c(
      "both" = colors[1],
      "wiltshire" = colors[2],
      "current" = colors[3]
    ),
    labels = c(
      'Both',
      'Wiltshire Only',
      'Current Only'
    )
  ) +
  theme_classic() +
  theme(
    axis.text = element_text(size = 12),
    axis.text.y = element_text(color = rev(dat$label_color)),
    axis.title = element_text(size = 14),
    legend.text = element_text(size = 12),
    legend.title = element_text(size = 12),
    legend.position = "bottom",
    plot.margin = margin(t = 10, r = 75, b = 10, l = 10)
  ) +
  guides(
    color = guide_legend(override.aes = list(size = 4, alpha = 1))
  ) +
  coord_flip() +
  labs(y = 'Secondary Data Count')

Bar Plot of Indicators

Otherwise, I won’t be diving into the usual PCA exploration for the production dataset because we have collected enough metrics to put together a mostly full, mostly coherent example framework with which we can try aggregating data. This should be coming in January.

1 Crop Diversity

I wanted to highlight this cropland data layer from USDA NASS in collaboration with USGS, NRCS, and FSA, among other agencies. It’s a crop-specific LULC layer derived from satellite imagery and ground-truthing. It seems to be about the best thrust at crop diversity across regions that I’ve found, but it also is certainly tailored toward primary crops, and may not represent New England very well. I’d love to hear thoughts on how useful this would be in New England.

Code
pacman::p_load(
  mapview,
  sf,
  stars,
  leaflet,
  leaflet.extras,
  leafpop
)
counties_sf <- readRDS('data/sm_data.rds')[['ne_counties_2024']]
fips_key <- readRDS('data/sm_data.rds')[['fips_key']]
crop <- readRDS('data/sm_data.rds')[['cropland_cros']]

counties <- left_join(counties_sf, fips_key)

div_map <- mapview(
  crop,
  zcol = '2023_30m_cdls',
  layer.name = 'Cropland Data Layer'
) + 
  mapview(
    counties,
    label = 'county_name',
    alpha.regions = 0
  )

div_map@map %>% 
  addFullscreenControl()

Crop diversity in New England

I went on to use this layer to calculate Shannon diversity for crop types at the county and state levels. Here is what it looks like:

Code
pacman::p_load(
  mapview,
  leaflet,
  stringr,
  sf
)
source('dev/data_pipeline_functions.R')

dat <- readRDS('data/sm_data.rds')

div <- dat$metrics %>% 
  filter(
    variable_name == 'cropDiversity',
    str_length(fips) == 5
  ) %>% 
  get_latest_year() %>% 
  mutate(value = round(as.numeric(value), 3))

div <- left_join(dat$ne_counties_2021, div)
mapview(
  div,
  zcol = 'value',
  label = 'value',
  layer.name = 'Crop Diversity'
)

Shannon diversity for crop production at county level.

Similarly, we could pull crop richness out of this dataset, but I have a feeling that the bias toward commodity crops would make that a bit more problematic.

2 Distribution Plots

2.1 By County

Note that while most of the available secondary data is at the county level, the environment dimension includes a fair amount at the state level as well. This includes greenhouse gas emissions and water quality surveys. For now, I’ll just show these separately, but some creative aggregation will have to happen eventually.

Code
pacman::p_load(
  dplyr,
  purrr,
  ggplot2,
  rlang,
  ggpubr,
  tidyr
)
source('dev/data_pipeline_functions.R')
source('dev/filter_fips.R')
metrics <- readRDS('data/sm_data.rds')[['metrics']]
metadata <- readRDS('data/sm_data.rds')[['metadata']]

# Use metadata to get help filter by dimension
prod_meta <- metadata %>%
  filter(dimension == 'production')

# Filter to economics dimension
prod_metrics <- metrics %>%
  filter(variable_name %in% prod_meta$variable_name)

# env_metrics$variable_name %>% unique
# get_str(env_metrics)

# Filter to latest year and new (post-2024) counties
# And pivot wider so it is easier to get correlations
prod_county <- prod_metrics %>%
  filter_fips(scope = 'counties') %>%
  get_latest_year() %>%
  select(fips, variable_name, value) %>%
  mutate(variable_name = str_split_i(variable_name, '_', 1)) %>%
  pivot_wider(
    names_from = 'variable_name',
    values_from = 'value'
  ) %>%
  unnest(!fips) %>%
  mutate(across(c(2:last_col()), as.numeric))

# Save temp file for use in analysis script
saveRDS(prod_county, 'data/temp/prod_county.rds')

## Plot
plots <- map(names(prod_county)[-1], \(var){
  if (is.character(prod_county[[var]])) {
    env_county %>%
      ggplot(aes(x = !!sym(var))) +
      geom_bar(
        fill = 'lightblue',
        color = 'royalblue',
        alpha = 0.5
      ) +
      theme_classic() +
      theme(plot.margin = unit(c(rep(0.5, 4)), 'cm'))
  } else if (is.numeric(prod_county[[var]])) {
    prod_county %>%
      ggplot(aes(x = !!sym(var))) +
      geom_density(
        fill = 'lightblue',
        color = 'royalblue',
        alpha = 0.5
      ) +
      theme_classic() +
      theme(plot.margin = unit(c(rep(0.5, 4)), 'cm'))
  } else {
    return(NULL)
  }
})


# Arrange them in 4 columns
ggarrange(
  plotlist = plots,
  ncol = 3,
  nrow = 4
)

Distributions of production metrics at the county level.

2.2 By State

Code
pacman::p_load(
  dplyr,
  purrr,
  ggplot2,
  rlang,
  ggpubr,
  tidyr
)

state_codes <- readRDS('data/sm_data.rds')[['fips_key']] %>%
  select(fips, state_code)

prod_state <- prod_metrics %>%
  filter_fips(scope = 'state') %>%
  get_latest_year() %>%
  select(fips, variable_name, value) %>%
  mutate(variable_name = str_split_i(variable_name, '_', 1)) %>%
  pivot_wider(
    names_from = 'variable_name',
    values_from = 'value'
  ) %>%
  unnest(!fips) %>%
  mutate(across(c(2:last_col()), as.numeric)) %>%
  left_join(state_codes, by = 'fips')

# Save temp data file for use in analysis script
saveRDS(prod_state, 'data/temp/prod_state.rds')

# Variables to map. 
vars <- names(prod_state)[-c(1, 43)]

## Plot
plots <- map(vars, \(var){
  prod_state %>%
    ggplot(aes(y = !!sym(var), x = state_code, color = state_code)) +
    geom_point(
      alpha = 0.5,
      size = 3
    ) +
    theme_classic() +
    theme(
      plot.margin = unit(c(rep(0.5, 4)), 'cm'),
      legend.position = 'none'
    ) +
    labs(
      x = 'State'
    )
})

# Arrange them in 4 columns
ggarrange(
  plotlist = plots,
  ncol = 4,
  nrow = 11
)

Distributions of production variables at state level

3 Bivariate Plots

Using a selection of variables at the county level.

Code
pacman::p_load(
  GGally
)

# Neat function for mapping colors to ggpairs plots
# https://stackoverflow.com/questions/45873483/ggpairs-plot-with-heatmap-of-correlation-values
map_colors <- function(data,
                       mapping,
                       method = "p",
                       use = "pairwise",
                       ...) {
  # grab data
  x <- eval_data_col(data, mapping$x)
  y <- eval_data_col(data, mapping$y)

  # calculate correlation
  corr <- cor(x, y, method = method, use = use)
  colFn <- colorRampPalette(c("blue", "white", "red"), interpolate = 'spline')
  fill <- colFn(100)[findInterval(corr, seq(-1, 1, length = 100))]

  # correlation plot
  ggally_cor(data = data, mapping = mapping, color = 'black', ...) +
    theme_void() +
    theme(panel.background = element_rect(fill = fill))
}

lower_function <- function(data, mapping, ...) {
  ggplot(data = data, mapping = mapping) +
    geom_point(alpha = 0.5) +
    geom_smooth(color = "blue", fill = "grey", ...) +
    theme_bw()
}

# Rename variables to be shorter
prod_county %>%
  select(-fips) %>% 
  ggpairs(
    upper = list(continuous = map_colors),
    lower = list(continuous = lower_function),
    axisLabels = 'show'
  ) +
  theme(
    strip.text = element_text(size =  5),
    axis.text = element_text(size =   5),
    legend.text = element_text(size = 5)
  )

4 Correlations

Only showing correlations by county because we don’t have enough observations to run it by state.

Code
pacman::p_load(
  dplyr,
  tidyr,
  tibble,
  stringr,
  purrr,
  tidyr,
  ggplot2,
  plotly,
  reshape,
  Hmisc,
  viridisLite
)

# get_str(env_county)

cor <- prod_county %>%
  select(-fips) %>%
  as.matrix() %>%
  rcorr()

# Melt correlation values and rename columns
cor_r <- melt(cor$r) %>%
  setNames(c('var_1', 'var_2', 'value'))

# Save p values
cor_p <- melt(cor$P)
p.value <- cor_p$value

# Make heatmap with custom text aesthetic for tooltip
plot <- cor_r %>%
  ggplot(aes(var_1, var_2, fill = value, text = paste0(
    'Var 1: ', var_1, '\n',
    'Var 2: ', var_2, '\n',
    'Correlation: ', format(round(value, 3), nsmall = 3), '\n',
    'P-Value: ', format(round(p.value, 3), nsmall = 3)
  ))) +
  geom_tile() +
  scale_fill_viridis_c() +
  theme(axis.text.x = element_text(hjust = 1, angle = 45)) +
  labs(
    x = NULL,
    y = NULL,
    fill = 'Correlation'
  )

# Convert to interactive plotly figure with text tooltip
ggplotly(
  plot,
  tooltip = 'text',
  width = 800,
  height = 500
)

Interactive correlation plot of metrics by county

Back to top